Introduction

The aim of this project is to assist banks in assessing the credit risk of their loan applicants.

A bank makes the decision whether or not to grant an applicant a loan based on their profile. This exposes the bank to two types of risk:

  • If the bank declines the application of a person with a good credit risk, then it loses business in the form of the interests that the client were going to pay.

  • If the bank approves the application of a person with a bad credit risk, then it loses the entirety of the loaned sum when that client defaults.

Our main goal is not only to minimize the risk while maximizing the profit of the bank but to also consider the fairness of our final model. This is relevant today because some of the existing credit scoring models seem to take gender into significant consideration.

We will be using the German Credit Dataset available here).

This dataset contains information about 1000 credit applicants. Each applicant is classified as either Good or Bad as a reflection of their credit risk:

  • A Good applicant has a low credit risk and is thus worthy of a loan.
  • A Bad applicant has a high credit risk and should not be granted a loan.

We start by importing the libraries that we will be using, they can be installed by running the follwoing code cell.

In [1]:
# !pip install imbalanced-learn>=0.6.1 lightgbm>=2.3.1 matplotlib>=3.1.2 numpy>=1.17.4 pandas>=0.25.3 plotly>=4.3.0 scikit-optimize>=0.5.2 seaborn>=0.9.0 scikit-learn>=0.22 xgboost>=0.90 fairlearn>=0.4.0 
In [2]:
import gc
from functools import partial
from copy import deepcopy
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

import sklearn
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

from sklearn.feature_selection import chi2
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.metrics import roc_curve, roc_auc_score, accuracy_score, make_scorer, confusion_matrix
from sklearn.inspection import permutation_importance

from imblearn.over_sampling import SMOTE

from fairlearn.postprocessing import ThresholdOptimizer

from skopt.space import Real, Integer
from skopt import gp_minimize

import matplotlib.pyplot as plt
from matplotlib import collections as mc
import seaborn as sns

# Custom created plotting functions
from plotting_functions import plot_variables, plot_pies, plot_importance_metrics, plot_confusion_matrix, plot_ROC, plot_votes

Exploratory Data Analysis

Loading the dataset

We start by taking a look our dataset.

In [3]:
data = pd.read_csv("german.data", delimiter=' ', header=None)
data.head()
Out[3]:
0 1 2 3 4 5 6 7 8 9 ... 11 12 13 14 15 16 17 18 19 20
0 A11 6 A34 A43 1169 A65 A75 4 A93 A101 ... A121 67 A143 A152 2 A173 1 A192 A201 1
1 A12 48 A32 A43 5951 A61 A73 2 A92 A101 ... A121 22 A143 A152 1 A173 1 A191 A201 2
2 A14 12 A34 A46 2096 A61 A74 2 A93 A101 ... A121 49 A143 A152 1 A172 2 A191 A201 1
3 A11 42 A32 A42 7882 A61 A74 2 A93 A103 ... A122 45 A143 A153 1 A173 2 A191 A201 1
4 A11 24 A33 A40 4870 A61 A73 3 A93 A101 ... A124 53 A143 A153 2 A173 2 A191 A201 2

5 rows × 21 columns

Most of the categories have been coded but we will se the supplied dataset description to decode them in the following cell in order to get a good grasp of what the different variables and categories are.

In [4]:
header = "Existing checking account   Duration in month   Credit history   Purpose   Credit amount   Savings account/bonds   Present employment since   Installment rate   Personal status   Other debtors   Present residence since   Property   Age   Other installment plans   Housing   Existing credits   Job   Number of people being liable   Telephone   Foreign worker   Score"
categories = {"Existing checking account": {"A11": " ... < 0 DM", "A12": "0 <= ... < 200 DM", "A13": " ... >= 200 DM /salary assignments for at least 1 year", "A14": "no checking account"},
              "Credit history": {"A30": "no credits taken/all credits paid back duly", "A31": "all credits at this bank paid back duly", "A32": "existing credits paid back duly till now", "A33": "delay in paying off in the past", "A34": "critical account/other credits existing (not at this bank)"},
              "Purpose": {"A40": "car (new)", "A41": "car (used)", "A42": "furniture/equipment", "A43": "radio/television", "A44": "domestic appliances", "A45": "repairs", "A46": "education", "A48": "retraining", "A49": "business", "A410": "others"},
              "Savings account/bonds": {"A61": "... < 100 DM", "A62": " 100 <= ... < 500 DM", "A63": " 500 <= ... < 1000 DM", "A64": " ... >= 1000 DM", "A65": " unknown/ no savings account"},
              "Present employment since": {"A71": "unemployed", "A72": " ... < 1 year", "A73": "1 <= ... < 4 years ", "A74": "4 <= ... < 7 years", "A75": " ... >= 7 years"},
              "Personal status": {"A91": "divorced/separated", "A92": "undefined", "A93": "single", "A94": "married/widowed"},
              "Gender": {"A91": "male", "A92": "female", "A93": "male", "A94": "male"},
              "Other debtors": {"A101": "none", "A102": "co-applicant", "A103": "guarantor"},
              "Property": {"A121": "real estate", "A122": "building society savings agreement/life insurance", "A123": "car or other", "A124": "unknown / no property"},
              "Other installment plans": {"A141": "bank", "A142": "stores", "A143": "none"},
              "Housing": {"A151": "rent", "A152": "own", "A153": "for free"},
              "Job": {"A171": "unemployed/ unskilled - non-resident", "A172": "unskilled - resident", "A173": "skilled employee / official", "A174": "management/ self-employed/highly qualified employee/ officer"},
              "Telephone": {"A191": "no", "A192": "yes"},
              "Foreign worker": {"A201": "yes", "A202": "no"}}

data = pd.read_csv("german.data", delimiter=' ', names=header.split('   '))

# Add a Gender variable to separate it from Personal status
data.insert(8, "Gender", data["Personal status"])

# Encode categorical variables
for category in categories:
    data[category] = data[category].map(categories[category])

# Encode score
data["Score"] = data["Score"].map({1: "good", 2: "bad"})

n, p = data.shape
print("We have ", n, " observations of ", p, "variables")
print("Number of missing values : ",
      len(data[data.isnull().isin([True])].stack()))
data.head()
We have  1000  observations of  22 variables
Number of missing values :  0
Out[4]:
Existing checking account Duration in month Credit history Purpose Credit amount Savings account/bonds Present employment since Installment rate Gender Personal status ... Property Age Other installment plans Housing Existing credits Job Number of people being liable Telephone Foreign worker Score
0 ... < 0 DM 6 critical account/other credits existing (not a... radio/television 1169 unknown/ no savings account ... >= 7 years 4 male single ... real estate 67 none own 2 skilled employee / official 1 yes yes good
1 0 <= ... < 200 DM 48 existing credits paid back duly till now radio/television 5951 ... < 100 DM 1 <= ... < 4 years 2 female undefined ... real estate 22 none own 1 skilled employee / official 1 no yes bad
2 no checking account 12 critical account/other credits existing (not a... education 2096 ... < 100 DM 4 <= ... < 7 years 2 male single ... real estate 49 none own 1 unskilled - resident 2 no yes good
3 ... < 0 DM 42 existing credits paid back duly till now furniture/equipment 7882 ... < 100 DM 4 <= ... < 7 years 2 male single ... building society savings agreement/life insurance 45 none for free 1 skilled employee / official 2 no yes good
4 ... < 0 DM 24 delay in paying off in the past car (new) 4870 ... < 100 DM 1 <= ... < 4 years 3 male single ... unknown / no property 53 none for free 2 skilled employee / official 2 no yes bad

5 rows × 22 columns

Other than the convoluted category coding scheme, the dataset is well structured and has no missing values.

Univariate analysis

We start by looking at the distribution of the categorical and discrete variables in our dataset.

In [5]:
cat_disc_vars = [var for var in list(data) if var not in ("Age", "Duration in month", "Credit amount")]
plot_variables(cat_disc_vars, 0, data)

We do not notice any abnormal distributions.

However, we note by looking at the Score variable that 30% of the applicants are classified as Bad and 70% as Good. We will most likely have to correct this imbalance in order to improve the performance of our future models.

We also note by looking at the Gender variable that only 31% of the applicants are women which is differnet from the 50%/50% split that we would expect to see. This may indicate another imbalance problem that, this time, relates to the fairness of our future models.

We now look at the distribution of the continuous variables.

In [6]:
plot_variables(["Age", "Credit amount", "Duration in month"], 1, data)

We notice that the three variables are positively skewed with a long tail indicating, respectively, that most applicants are young, have received a small loan, and for a duration that is less than 24 months.

Bivariate analysis

We now look at the distribution of the Score variable across other categorical variables.

In [7]:
labels = [var for var in list(data) if var not in ("Age", "Duration in month", "Credit amount", "Purpose", "Score")]
plot_pies(labels, data, categories)

The distribution of the scores of most variables is close to what we can logically expect. A few variables like Credit history and Job seem counterintuitive, this may be due to the category labels might be mis-specified.

We can also note that variables like Number of people being liable, Present residence since and Telephone have a low variance across the different score classes which may indicate their poor predictive power when compared to variables such as Existing checking account or Credit history etc.

We now look at how the continuos variables are distributed for each scrore class.

In [8]:
plot_variables(["Age", "Credit amount", "Duration in month"], 2, data)

The boxplots show that applications with an older applicant age, a lower credit amount and a shorter loan duration are less likely to be bad.

This is quite logical and depsite the differences between the two boxplots not being very large, we can expect these three variables to have a significant predictive power.

Feature engineering

In this section we modify our data by adding new dummy variables and transforming some of the existing ones.

We start by log-transforming the Credit amount variable in order to reduce the scale difference across the variables.

In [9]:
data["Credit amount"] = np.log(data["Credit amount"])

Some of the categorical variables are actually ordered. We encode them accordingly.

In [10]:
ordinal_categories = {"Existing checking account": {" ... < 0 DM": 0, "0 <= ... < 200 DM": 2, " ... >= 200 DM /salary assignments for at least 1 year": 3, "no checking account": 1},
                      "Savings account/bonds": {"... < 100 DM": 1, " 100 <= ... < 500 DM": 2, " 500 <= ... < 1000 DM": 3, " ... >= 1000 DM": 4, " unknown/ no savings account": 0},
                      "Present employment since": {"unemployed": 0, " ... < 1 year": 1, "1 <= ... < 4 years ": 2, "4 <= ... < 7 years": 3, " ... >= 7 years": 4},
                      "Property": {"real estate": 3, "building society savings agreement/life insurance": 2, "car or other": 1, "unknown / no property": 0},
                      "Housing": {"rent": 0, "own": 2, "for free": 1},
                      "Job": {"unemployed/ unskilled - non-resident": 0, "unskilled - resident": 1, "skilled employee / official": 3, "management/ self-employed/highly qualified employee/ officer": 2},
                      "Score": {"bad": 0, "good": 1}}
for category in ordinal_categories:
    data[category] = data[category].map(ordinal_categories[category]).astype("int64")

With the ordinal variables correctly ordered we can look at the correlation matrix of some our variables.

In [11]:
corr = data.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f85b658c590>

We can see that the correlation coefficients are not particularly high, indicating the lack of multicollinearity. This eliminates the need for exploring certain dimension reduction techniques such as PCA.

With the ordinal variables correctly encoded, we now one-hot encode the remaining categorical variables. This creates dummy variables for each category.

In [12]:
data = pd.get_dummies(data, drop_first=True)
# Drop redundant variable
data.drop('Personal status_undefined', axis=1, inplace=True) 
X = data.drop('Score', axis=1)
y = data['Score']

Model selection

In this section we test various baseline models. We will use 5-fold crossvalidation to determine the best candidate models moving forward. "Best" is defined as having a low cost which is defined in the following function

In [13]:
# Define custom cost function for our problem
def cost(y_true, y_pred):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    # we normalize the interest from a good credit to 1, and the loss from a credit default to 5
    max_gain = (tp + fn) * 1 # maximum gain = number of actual good credit applicants
    # TP generate an interest of 1, FP results in a loss of 5, and FN cause an opportunity cost of 1
    actual_gain = (tn * 0 + tp * 1 - fp * 5 - fn * 1)
    cost = max_gain - actual_gain
    return cost/max_gain  # Normalization 

costMetric = make_scorer(cost)
metrics = {'Cost': costMetric}

Unbalanced dataset

We start by testing the different models on the unbalanced dataset.

In [14]:
models = {'Logistic Regression': LogisticRegression(solver="liblinear"),
          'KNN': KNeighborsClassifier(),
          'Random Forest': RandomForestClassifier(),
          'SVM': SVC(),
          'XGBoost': XGBClassifier(),
          'Adaboost': AdaBoostClassifier(),
          'LightGBM': LGBMClassifier(),
          'Gradient Boosting': GradientBoostingClassifier()}

scoringMetric = "Cost"
results = pd.DataFrame(columns=models)
for model in models:
    kfold = KFold(shuffle=True, n_splits=5, random_state=0)
    cvResults = cross_val_score(models[model], X, y, cv=kfold, scoring=metrics[scoringMetric])
    results[model] = cvResults

bxplot = sns.boxplot(x="Model", y=scoringMetric, data=results.melt(var_name="Model", value_name=scoringMetric))
_ = bxplot.set_xticklabels(bxplot.get_xticklabels(), rotation=90)

Balanced dataset

We now test the same models on the balanced dataset

To balance the data, we use an oversampling method because the number of observations is limited and trying an undersampling method resulted in worse performance.

We choose to use SMOTE to generate examples for the underrepresented class. This algorithm selects an observation $X$ from this class, computes its closest $k$ neighbors (all classes included), selects one random neighbor $N$ from them, then generates a new observation defined by the barycenter of $X$ and $N$ using random weights.

We make sure to balance each fold and not the entire dataset in order to not inflate our scores.

In [15]:
models = {'Logistic Regression': LogisticRegression(solver="liblinear"),
          'KNN': KNeighborsClassifier(),
          'Random Forest': RandomForestClassifier(),
          'SVM': SVC(),
          'XGBoost': XGBClassifier(),
          'Adaboost': AdaBoostClassifier(),
          'LightGBM': LGBMClassifier(),
          'Gradient Boosting': GradientBoostingClassifier()}

scoringMetric = "Cost"
results = pd.DataFrame(columns=models)
for model in models:
    classifier = models[model]
    kfold = KFold(shuffle=True, n_splits=5, random_state=0)
    cvResults = []
    # Use SMOTE to balance each fold
    for train_idx, test_idx, in kfold.split(X, y):
        X_train, y_train = X.iloc[train_idx], y[train_idx]
        X_test, y_test = X.iloc[test_idx], y[test_idx]
        X_train, y_train = SMOTE().fit_sample(X_train, y_train)
        X_train = pd.DataFrame(X_train, columns=X.columns)
        classifier.fit(X_train, y_train)
        y_pred = classifier.predict(X_test)
        cvResults.append(cost(y_test, y_pred))
    results[model] = cvResults

bxplot = sns.boxplot(x="Model", y=scoringMetric, data=results.melt(var_name="Model", value_name=scoringMetric))
_ = bxplot.set_xticklabels(bxplot.get_xticklabels(), rotation=90)

We can see that balancing the training dataset results in better results on the test set as evident by the lower costs. Some models (Random Forest, XGBoost, LightGBM and Gradient Boosting) have lower costs than others so we select these models moving forward

We will also continue working with the balanced dataset henceforward.

Dataset splitting

We identified the models having the most potential, we now split our dataset in a stratified fashion into Train, Validation and Test sets. We make sure not to balance our Test set in order to not bias our evaluations.

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.15,
                                                    shuffle=True,
                                                    random_state=0,
                                                    stratify=y)
# Rebalance training set
X_train, y_train = SMOTE(random_state=0).fit_sample(X_train, y_train)
X_train = pd.DataFrame(X_train, columns=X.columns)             
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train,
                                                    test_size=0.15,
                                                    shuffle=True,
                                                    random_state=0,
                                                    stratify=y_train)

Feature selection

We have multiple variables that seem unimportant, this section aims is to identify and remove these features.

In [17]:
print("Number of features: ", X.shape[1])
Number of features:  35

We will evaluate which features according to a number of metrics: Chi2, R2, XGBoost importance and RF importance. Each metric has its own merits as well as its caveats, which is why we average the four of them.

In [18]:
xgbcls = XGBClassifier(random_state=0)  # XGB
xgbcls.fit(X_train, y_train)
feat_importance = pd.DataFrame()
feat_importance["XGB"] = xgbcls.feature_importances_
rfcls = RandomForestClassifier(random_state=0)  # RF
rfcls.fit(X_train, y_train)
feat_importance["RF"] = rfcls.feature_importances_
feat_importance["Chi2"] = [(1-x)/10 for x in chi2(X_train, y_train)[1]]  # Chi2
# The metrics are scaled to be of the same order of magnitude
corr = [] # R2
for col in X_train.columns:
        corr.append(np.corrcoef(X_train[col], y_train)[0, 1]**2)
feat_importance["Correlation"] = corr
feat_importance["Average"] = feat_importance.mean(axis=1)
sorted_idx = feat_importance["Average"].argsort()
# Plotting the metrics
plot_importance_metrics(feat_importance, X_train.columns)

Most variables appear potentially useful, we will only remove the 5 least important ones

In [19]:
sorted_idx = feat_importance["Average"].argsort()
X_train.drop(X_train.columns[sorted_idx[:5]], axis=1, inplace=True)
X_val.drop(X_val.columns[sorted_idx[:5]], axis=1, inplace=True)
X_test.drop(X_test.columns[sorted_idx[:5]], axis=1, inplace=True)

Hyperparameter Optimization

In this section we improve our selected models by tuning their hyperparameters using Bayesian optimization. This technique, in a nutshell, is an efficient sequential strategy for global optimization of evaluation-expensive black-box functions that doesn't require derivatives. It treats the objective as a random function and after gathering the function evaluations, it updates the posterior distribution over the objective function which will be used to determine the query points for the next iteration. In this case the objective function to minimize is our custom-defined cost function, and the optimization space we will use is the cartesian product of intervals containing the default values of each hyperparameter.

In order to avoid biasing our final evaluations we are using a validation set for this tuning process.

We start by defining the optimise function that is used to find the optimal paremeters in a given search space for a given number of iterations.

We also define the evaluate_tuning function which is used to compare the confusion matrices of model before after the tuning procedure.

In [20]:
def optimise(model, space, X_train, y_train, X_val, y_val, n_calls = 20):
    def model_assessment(args, X_train, y_train, X_val, y_val):
        # fit then evaluate the model on validation data then return its score
        params = {curr_model_hyper_params[i]: args[i] for i, j in enumerate(curr_model_hyper_params)}
        copy.set_params(**params)
        fitted_model = copy.fit(X_train, y_train)
        val_predictions = fitted_model.predict(X_val)
        val_score = cost(y_val, val_predictions)
        return(val_score)
    copy = deepcopy(model)
    curr_model_hyper_params = [i.name for i in space]
    # define the objective function as the cost given by the model on validation data
    objective_function = partial(model_assessment, X_train=X_train, y_train=y_train, X_val=X_val, y_val=y_val)
    # use bayesian optimization to minimize the objective function
    results = gp_minimize(objective_function, space, base_estimator=None, n_calls=n_calls, n_random_starts=20, random_state=0)
    # get the parameters that minimize the cost
    bestParams = dict(zip(curr_model_hyper_params, results.x))
    return copy.set_params(**bestParams)

def evaluate_tuning(untuned, tuned, name, X_train, y_train, X_test, y_test):
    # fit the tuned and untuned models to the train data then plot their confusion matrices
    fig, axes = plt.subplots(ncols=2, figsize=(16, 6))
    untuned.fit(X_train, y_train)
    tuned.fit(X_train, y_train)
    cm_untuned = confusion_matrix(y_test, untuned.predict(X_test))
    cm_tuned = confusion_matrix(y_test, tuned.predict(X_test))
    cost_untuned = cost(y_test, untuned.predict(X_test))
    cost_tuned = cost(y_test, tuned.predict(X_test))
    plot_confusion_matrix(cm_untuned, ax=axes[0], title=f"Untuned {name} (Cost:{cost_untuned:0.3f})")
    plot_confusion_matrix(cm_tuned, ax=axes[1], title=f"Tuned {name} (Cost:{cost_tuned:0.3f})")

Fine-tuning the XGBoost model

In [21]:
space = [Real(0.001, 0.3, name="gamma"),
         Real(0.05, 0.8, name="learning_rate"),
         Real(0.6, 5, name="max_delta_step"),
         Integer(3, 10, name="max_depth"),
         Real(0.2, 5, name="min_child_weight"),
         Integer(50, 200, name="n_estimators")]

xgbcls = XGBClassifier(random_state=0)
bestXGBoost = optimise(xgbcls, space, X_train, y_train, X_val, y_val,50)
evaluate_tuning(xgbcls, bestXGBoost, "XGBoost",X_train, y_train, X_test, y_test)

We can see that the tuned model has a better confusion matrix and a lower cost.

Fine-tuning the Random Forest model

In [22]:
space = [Integer(2, 50, name="min_samples_split"),
    Integer(1, 50, name="min_samples_leaf"),
    Integer(3, 200, name="max_depth"),
    Integer(100, 300, name="n_estimators")]
rfcls = RandomForestClassifier( random_state=0)
bestRF = optimise(rfcls, space, X_train, y_train, X_val, y_val, 40)
evaluate_tuning(rfcls, bestRF, "Random Forest",X_train, y_train, X_test, y_test)

We can see that the tuned model has a better confusion matrix and a lower cost.

Fine-tuning the LightGBM model

In [23]:
space = [Real(0.04, 0.7, name="learning_rate"),
         Integer(0, 15, name="max_depth"),
         Integer(10, 50, name="num_leaves"),
         Real(0.001, 3, name="min_child_weight"),
         Integer(50, 300, name="n_estimators")]
lgbmcls = LGBMClassifier(random_state=0)
bestLGBM = optimise(lgbmcls, space, X_train, y_train, X_val, y_val, 50)
evaluate_tuning(lgbmcls, bestLGBM, "LightGBM",X_train, y_train, X_test, y_test)

We can see that the tuned model has a better confusion matrix and a lower cost.

Fine-tuning the Gradient Boosting model

In [24]:
space = [Integer(3, 10, name="min_samples_split"),
         Integer(5, 7, name="min_samples_leaf"),
         Integer(5, 8, name="max_depth"),
         Real(0.8, 0.9, name="subsample"),
         Integer(10, 15, name="max_features"),
         Integer(50, 100, name="n_estimators")]
gbcls = GradientBoostingClassifier(random_state=0)
bestGBoost = optimise(gbcls, space, X_train, y_train, X_val, y_val, 50)
evaluate_tuning(gbcls, bestGBoost, "Gradient Boosting",X_train, y_train, X_test, y_test)

We can see that the tuned model has a better confusion matrix and a lower cost. On average, we can see that tuning the models resulted in approximately 10% fewer false classifications

Ensemble model

Individual model evaluation

We start by looking at the different ROC curves of the models, we can see that they all have a similar performance. We can't choose a single model because none of them stands out.

In [25]:
classifiers = (bestRF,
               bestXGBoost,
               bestLGBM,
               bestGBoost)
result_table = pd.DataFrame(columns=['classifier', 'fpr', 'tpr', 'auc', 'cm', 'cost'])

for cls in classifiers:
    model = cls.fit(X_train, y_train)
    yproba = model.predict_proba(X_test)[::, 1]
    y_pred = model.predict(X_test)
    fpr, tpr, _ = roc_curve(y_test,  yproba)
    auc = roc_auc_score(y_test, yproba)
    cm = confusion_matrix(y_test, y_pred)
    cst = cost(y_test, y_pred)
    result_table = result_table.append({'classifier': cls.__class__.__name__,
                                        'fpr': fpr,
                                        'tpr': tpr,
                                        'auc': auc,
                                        'cm': cm,
                                        'cost': cst}, ignore_index=True)
result_table.set_index('classifier', inplace=True)
plot_ROC(result_table)

Ensemble model evaluation

We now create an ensemble out of all these models and see how it performs.

In [26]:
estimators = [('Random Forest', bestRF),
              ('XGBoost', bestXGBoost),
              ('LightGBM', bestLGBM),
              ('gradboost', bestGBoost)]
result_table.reset_index('classifier', inplace=True)

ensemble = VotingClassifier(estimators, voting='soft')
ensemble.fit(X_train, y_train)

yproba = ensemble.predict_proba(X_test)[::, 1]
yproba = pd.Series(yproba)
yproba.index = y_test.index
y_pred = ensemble.predict(X_test)
y_pred = pd.Series(y_pred)
y_pred.index = y_test.index
fpr, tpr, _ = roc_curve(y_test,  yproba)
auc = roc_auc_score(y_test, yproba)
cm = confusion_matrix(y_test, y_pred)
cst = cost(y_test, y_pred)
result_table = result_table.append({'classifier': ensemble.__class__.__name__,
                                    'fpr': fpr,
                                    'tpr': tpr,
                                    'auc': auc,
                                    'cm': cm,
                                    'cost': cst}, ignore_index=True)
result_table.set_index('classifier', inplace=True)
plot_ROC(result_table)

We can see that the ensemble model has a higher AUC than each individual model and a low cost. We can also look at its confusion matrix which is indeed better than each of all the previous models'.

In [27]:
plot_confusion_matrix(result_table['cm']['VotingClassifier'])
Out[27]:

Ensemble model analysis

We now look at the ensemble model in greater detail.

Since most of our individuals models are based on boosting and bagging and are thus rather stable, we decided to choose a soft voting ensembling technique in which the probability vectors for each predicted class (for all models) are summed up and averaged.

The individual models are not always in sync. For example we visualize below the contribution of each classifier to the final predicted class of a random sample.

In [28]:
probas = [c.predict_proba(X_test) for c in (bestRF, bestXGBoost, bestLGBM, bestGBoost, ensemble)]
# get class probabilities for a sample
bad = [pr[59, 0] for pr in probas]
good = [pr[59, 1] for pr in probas]
plot_votes(bad, good, result_table.index.values)

We now look at which features are most important for the model by looking at the permutation feature importance. This metric is defined as the decrease in a model score when a single feature value is randomly shuffled. This procedure breaks the relationship between the feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature.

In [29]:
result = permutation_importance(ensemble, X_train, y_train, n_repeats=10,
                                random_state=42, n_jobs=2)
sorted_idx = result.importances_mean.argsort()

fig, ax = plt.subplots(figsize=(5,10), dpi=200)
ax.boxplot(result.importances[sorted_idx].T,
           vert=False, labels=X_train.columns[sorted_idx])
ax.set_title("Permutation Importance (train set)")
fig.tight_layout()
plt.show()

We can see that the most important features relate to the credit amount, duration and the background of the applicant. Fortunately we can see that the gender of the applicant (the dummy variable Gender_male) ranks low on the importance scale which may be inidicative of the lack of discrimination based on gender. We will however investigate this further in the following section.

Model fairness analysis

We start by defining the evaluate_groups function which is used to calculate different metrics for each sub-group of applicants.

Demographic Parity states that the proportion of each gender should receive the positive outcome at equal rates, i.e.

$$P(Good|Gender=Male) = P(Good|Gender=Female)$$

In practise, we may not require for the difference in Positive Rates to equal 0, but we will aim to minimise this gap.

Equalised Odds requires the positive outcome to be independent of the gender, conditional on the actual reference class, i.e.

$$P(Prediction=Good|Gender=Male,Reference=class) = P(Prediction=Good|Gender=Female, Reference=class)\\ class \in \{Good, Bad\}$$

Based on the confusion matrix, we may not require the True Positive Rate and False Positive Rate to be the same for each gender but we will aim to minimise both gaps.

In [30]:
def evaluate_groups(y_test, y_pred, indexes, names, yproba=pd.Series()):
    group_results = pd.DataFrame(columns=['group', 'fpr', 'tpr', 'auc', 'cm', 'cost'])

    for group_idx, group_name in zip(indexes, names):
        y_test_group = y_test[group_idx]
        y_pred_group = y_pred[group_idx]
        if not yproba.empty:
            yproba_group = yproba[group_idx]
            fpr, tpr, _ = roc_curve(y_test_group,  yproba_group)
            auc = roc_auc_score(y_test_group, yproba_group)
        else:
            fpr, tpr, auc = 0, 0, 0
        cm = confusion_matrix(y_test_group, y_pred_group)
        tn, fp, fn, tp = cm.ravel()
        cst = cost(y_test_group, y_pred_group)
        eo = tp/(tp + fn)
        pdp = (tp + fp) / (tp + fp + tn + fn)
        group_results = group_results.append({'group': group_name,
                                            'fpr': fpr,
                                            'tpr': tpr,
                                            'auc': auc,
                                            'cm': cm,
                                            'cost': cst,
                                            'equalized_odds': eo,
                                            'prop_demo_parity': pdp}, ignore_index=True)
    group_results.set_index('group', inplace=True)
    return group_results
In [31]:
male_idx = X_test[X_test["Gender_male"]==1].index.values
female_idx = X_test[X_test["Gender_male"]==0].index.values
group_evaluations = evaluate_groups(y_test, y_pred, [male_idx, female_idx], ["Male", "Female"], yproba)

We can look at the confusuion matrix of each gender as well as their equalized odds and proportional demographic parity.

In [32]:
fig, axes = plt.subplots(ncols=2, figsize=(16, 6))
plot_confusion_matrix(group_evaluations["cm"][0], ax=axes[0], title=f'Male (Cost: {group_evaluations["cost"][0]:0.3f})')
plot_confusion_matrix(group_evaluations["cm"][1], ax=axes[1], title=f'Female (Cost:{group_evaluations["cost"][1]:0.3f})')
In [33]:
print(group_evaluations[["equalized_odds", "prop_demo_parity"]])
        equalized_odds  prop_demo_parity
group                                   
Male          0.961039          0.801887
Female        0.714286          0.613636

We start by creating a wrapper class for our ensemble model. This new classifier outputs probabilites instead of class labels.

In [34]:
class EnsembleClassifier:
    def __init__(self, estimator):
        self.estimator = estimator
    
    def fit(self, X, y):
        self.estimator.fit(X, y)
    
    def predict(self, X):
        probabilities = self.estimator.predict_proba(X)[:,1]
        return probabilities

Instead of outputing a certain class because its probability is higher than 0.5, we choose a different threshold for each gender. The thresholds are chosen in a way to minimize the gap in Demographic Parity, which will in return reduce the gap in Equalized Odds.

In [35]:
ensmebleWrapper = EnsembleClassifier(ensemble)
postprocessedEstimator = ThresholdOptimizer(
    unconstrained_predictor=ensmebleWrapper,
    constraints="demographic_parity")

postprocessedEstimator._plot = True
postprocessedEstimator.fit(X_train, y_train, sensitive_features=X_train["Gender_male"])

ypred_new = postprocessedEstimator.predict(X_test, sensitive_features=X_test["Gender_male"])
ypred_new = pd.Series(ypred_new)
ypred_new.index = y_test.index

group_evaluations = evaluate_groups(y_test, ypred_new, [male_idx, female_idx], ["Male", "Female"])

We look at the confusion matrix of each gender and we can see that the performance of the model has degraded. The model classified women as good more often which results in more False Positives.

In [36]:
fig, axes = plt.subplots(ncols=2, figsize=(16, 6))
plot_confusion_matrix(group_evaluations["cm"][0], ax=axes[0], title=f'Male (Cost: {group_evaluations["cost"][0]:0.3f})')
plot_confusion_matrix(group_evaluations["cm"][1], ax=axes[1], title=f'Female (Cost:{group_evaluations["cost"][1]:0.3f})')

Despite this degraded performance, we can see that the model is more fair towards both genders now. The gaps in both metrics are significantly lower than before.

In [37]:
print(group_evaluations[["equalized_odds", "prop_demo_parity"]])
        equalized_odds  prop_demo_parity
group                                   
Male          0.935065          0.783019
Female        0.785714          0.750000

Conclusion

In conclusion, machine learning seems to offer some decent insight to credit risk assessment. Achieving an AUC of ~84% on a problem that is difficult for humans and rule-based systems proves so.

Our ensembling technique resulted in an improved performance and since it is based on probabilites, the model can be analyzed and improved, as evident by our ex-post fairness enhancing technique.

Using SciKit-Learn means that our model can be easily fitted and applied to new data. The model achieves results good enough to be used in a business environment, although training it on more data should make it perform better. In such a use case, online learning isn't necessary since creditworthiness measure doesn't change over time. The final model is scalable and can fit to 10-100 times bigger datasets quickly. Re-tuning the model takes a bit more time but it remains reasonable in a batch-learning context.

In regards to fairness it is evident that there is an arbitrage between having the best model performance (hence the most profit) and having more fairness. Nonetheless it is definitely possible to build a model that does not discriminate from the ground up.

Other techniques such as neural networks might have some added value but we doubt that they will be effective in this case given the small size of our dataset.